13 research outputs found
Comparative Learning: A Sample Complexity Theory for Two Hypothesis Classes
In many learning theory problems, a central role is played by a hypothesis class: we might assume that the data is labeled according to a hypothesis in the class (usually referred to as the realizable setting), or we might evaluate the learned model by comparing it with the best hypothesis in the class (the agnostic setting). Taking a step beyond these classic setups that involve only a single hypothesis class, we study a variety of problems that involve two hypothesis classes simultaneously.
We introduce comparative learning as a combination of the realizable and agnostic settings in PAC learning: given two binary hypothesis classes S and B, we assume that the data is labeled according to a hypothesis in the source class S and require the learned model to achieve an accuracy comparable to the best hypothesis in the benchmark class B. Even when both S and B have infinite VC dimensions, comparative learning can still have a small sample complexity. We show that the sample complexity of comparative learning is characterized by the mutual VC dimension VC(S,B) which we define to be the maximum size of a subset shattered by both S and B. We also show a similar result in the online setting, where we give a regret characterization in terms of the analogous mutual Littlestone dimension Ldim(S,B). These results also hold for partial hypotheses.
We additionally show that the insights necessary to characterize the sample complexity of comparative learning can be applied to other tasks involving two hypothesis classes. In particular, we characterize the sample complexity of realizable multiaccuracy and multicalibration using the mutual fat-shattering dimension, an analogue of the mutual VC dimension for real-valued hypotheses. This not only solves an open problem proposed by Hu, Peale, Reingold (2022), but also leads to independently interesting results extending classic ones about regression, boosting, and covering number to our two-hypothesis-class setting
A Unifying Theory of Distance from Calibration
We study the fundamental question of how to define and measure the distance
from calibration for probabilistic predictors. While the notion of perfect
calibration is well-understood, there is no consensus on how to quantify the
distance from perfect calibration. Numerous calibration measures have been
proposed in the literature, but it is unclear how they compare to each other,
and many popular measures such as Expected Calibration Error (ECE) fail to
satisfy basic properties like continuity.
We present a rigorous framework for analyzing calibration measures, inspired
by the literature on property testing. We propose a ground-truth notion of
distance from calibration: the distance to the nearest perfectly
calibrated predictor. We define a consistent calibration measure as one that is
polynomially related to this distance. Applying our framework, we identify
three calibration measures that are consistent and can be estimated
efficiently: smooth calibration, interval calibration, and Laplace kernel
calibration. The former two give quadratic approximations to the ground truth
distance, which we show is information-theoretically optimal in a natural model
for measuring calibration which we term the prediction-only access model. Our
work thus establishes fundamental lower and upper bounds on measuring the
distance to calibration, and also provides theoretical justification for
preferring certain metrics (like Laplace kernel calibration) in practice.Comment: In STOC 202
When Does Optimizing a Proper Loss Yield Calibration?
Optimizing proper loss functions is popularly believed to yield predictors
with good calibration properties; the intuition being that for such losses, the
global optimum is to predict the ground-truth probabilities, which is indeed
calibrated. However, typical machine learning models are trained to
approximately minimize loss over restricted families of predictors, that are
unlikely to contain the ground truth. Under what circumstances does optimizing
proper loss over a restricted family yield calibrated models? What precise
calibration guarantees does it give? In this work, we provide a rigorous answer
to these questions. We replace the global optimality with a local optimality
condition stipulating that the (proper) loss of the predictor cannot be reduced
much by post-processing its predictions with a certain family of Lipschitz
functions. We show that any predictor with this local optimality satisfies
smooth calibration as defined in Kakade-Foster (2008), B{\l}asiok et al.
(2023). Local optimality is plausibly satisfied by well-trained DNNs, which
suggests an explanation for why they are calibrated from proper loss
minimization alone. Finally, we show that the connection between local
optimality and calibration error goes both ways: nearly calibrated predictors
are also nearly locally optimal
Generative Models of Huge Objects
This work initiates the systematic study of explicit distributions that are indistinguishable from a single exponential-size combinatorial object. In this we extend the work of Goldreich, Goldwasser and Nussboim (SICOMP 2010) that focused on the implementation of huge objects that are indistinguishable from the uniform distribution, satisfying some global properties (which they coined truthfulness). Indistinguishability from a single object is motivated by the study of generative models in learning theory and regularity lemmas in graph theory. Problems that are well understood in the setting of pseudorandomness present significant challenges and at times are impossible when considering generative models of huge objects.
We demonstrate the versatility of this study by providing a learning algorithm for huge indistinguishable objects in several natural settings including: dense functions and graphs with a truthfulness requirement on the number of ones in the function or edges in the graphs, and a version of the weak regularity lemma for sparse graphs that satisfy some global properties. These and other results generalize basic pseudorandom objects as well as notions introduced in algorithmic fairness. The results rely on notions and techniques from a variety of areas including learning theory, complexity theory, cryptography, and game theory
Omnipredictors for Constrained Optimization
The notion of omnipredictors (Gopalan, Kalai, Reingold, Sharan and Wieder
ITCS 2021), suggested a new paradigm for loss minimization. Rather than
learning a predictor based on a known loss function, omnipredictors can easily
be post-processed to minimize any one of a rich family of loss functions
compared with the loss of hypotheses in a class . It has been shown
that such omnipredictors exist and are implied (for all convex and Lipschitz
loss functions) by the notion of multicalibration from the algorithmic fairness
literature. In this paper, we introduce omnipredictors for constrained
optimization and study their complexity and implications. The notion that we
introduce allows the learner to be unaware of the loss function that will be
later assigned as well as the constraints that will be later imposed, as long
as the subpopulations that are used to define these constraints are known. We
show how to obtain omnipredictors for constrained optimization problems,
relying on appropriate variants of multicalibration. We also investigate the
implications of this notion when the constraints used are so-called group
fairness notions
Simple, Scalable and Effective Clustering via One-Dimensional Projections
Clustering is a fundamental problem in unsupervised machine learning with
many applications in data analysis. Popular clustering algorithms such as
Lloyd's algorithm and -means++ can take time when clustering
points in a -dimensional space (represented by an matrix
) into clusters. In applications with moderate to large , the
multiplicative factor can become very expensive. We introduce a simple
randomized clustering algorithm that provably runs in expected time
for arbitrary . Here is the
total number of non-zero entries in the input dataset , which is upper
bounded by and can be significantly smaller for sparse datasets. We prove
that our algorithm achieves approximation ratio on
any input dataset for the -means objective. We also believe that our
theoretical analysis is of independent interest, as we show that the
approximation ratio of a -means algorithm is approximately preserved under a
class of projections and that -means++ seeding can be implemented in
expected time in one dimension. Finally, we show experimentally
that our clustering algorithm gives a new tradeoff between running time and
cluster quality compared to previous state-of-the-art methods for these tasks.Comment: 41 pages, 6 figures, to appear in NeurIPS 202